dna profile
deepNoC: A deep learning system to assign the number of contributors to a short tandem repeat DNA profile
Taylor, Duncan, Humphries, Melissa A.
A common task in forensic biology is to interpret and evaluate short tandem repeat DNA profiles. The first step in these interpretations is to assign a number of contributors to the profiles, a task that is most often performed manually by a scientist using their knowledge of DNA profile behaviour. Studies using constructed DNA profiles have shown that as DNA profiles become more complex, and the number of DNA-donating individuals increases, the ability for scientists to assign the target number. There have been a number of machine learning algorithms developed that seek to assign the number of contributors to a DNA profile, however due to practical limitations in being able to generate DNA profiles in a laboratory, the algorithms have been based on summaries of the available information. In this work we develop an analysis pipeline that simulates the electrophoretic signal of an STR profile, allowing virtually unlimited, pre-labelled training material to be generated. We show that by simulating 100 000 profiles and training a number of contributors estimation tool using a deep neural network architecture (in an algorithm named deepNoC) that a high level of performance is achieved (89% for 1 to 10 contributors). The trained network can then have fine-tuning training performed with only a few hundred profiles in order to achieve the same accuracy within a specific laboratory. We also build into deepNoC secondary outputs that provide a level of explainability to a user of algorithm, and show how they can be displayed in an intuitive manner.
- Oceania > Australia > South Australia > Adelaide (0.14)
- Oceania > New Zealand (0.04)
- North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
- North America > United States > California > Los Angeles County > Santa Monica (0.04)
A novel application of Shapley values for large multidimensional time-series data: Applying explainable AI to a DNA profile classification neural network
Elborough, Lauren, Taylor, Duncan, Humphries, Melissa
The application of Shapley values to high-dimensional, time-series-like data is computationally challenging - and sometimes impossible. For $N$ inputs the problem is $2^N$ hard. In image processing, clusters of pixels, referred to as superpixels, are used to streamline computations. This research presents an efficient solution for time-seres-like data that adapts the idea of superpixels for Shapley value computation. Motivated by a forensic DNA classification example, the method is applied to multivariate time-series-like data whose features have been classified by a convolutional neural network (CNN). In DNA processing, it is important to identify alleles from the background noise created by DNA extraction and processing. A single DNA profile has $31,200$ scan points to classify, and the classification decisions must be defensible in a court of law. This means that classification is routinely performed by human readers - a monumental and time consuming process. The application of a CNN with fast computation of meaningful Shapley values provides a potential alternative to the classification. This research demonstrates the realistic, accurate and fast computation of Shapley values for this massive task
- Research Report (0.64)
- Overview > Innovation (0.40)
Simulating realistic short tandem repeat capillary electrophoretic signal using a generative adversarial network
Taylor, Duncan, Humphries, Melissa
DNA profiles are made up from multiple series of electrophoretic signal measuring fluorescence over time. Typically, human DNA analysts 'read' DNA profiles using their experience to distinguish instrument noise, artefactual signal, and signal corresponding to DNA fragments of interest. Recent work has developed an artificial neural network, ANN, to carry out the task of classifying fluorescence types into categories in DNA profile electrophoretic signal. But the creation of the necessarily large amount of labelled training data for the ANN is time consuming and expensive, and a limiting factor in the ability to robustly train the ANN. If realistic, prelabelled, training data could be simulated then this would remove the barrier to training an ANN with high efficacy. Here we develop a generative adversarial network, GAN, modified from the pix2pix GAN to achieve this task. With 1078 DNA profiles we train the GAN and achieve the ability to simulate DNA profile information, and then use the generator from the GAN as a 'realism filter' that applies the noise and artefact elements exhibited in typical electrophoretic signal.
- Europe > Austria > Vienna (0.14)
- Oceania > Australia > South Australia > Adelaide (0.04)
- Oceania > New Zealand (0.04)
DNA mixture deconvolution using an evolutionary algorithm with multiple populations, hill-climbing, and guided mutation
Vilsen, Søren B., Tvedebrink, Torben, Eriksen, Poul Svante
DNA samples crime cases analysed in forensic genetics, frequently contain DNA from multiple contributors. These occur as convolutions of the DNA profiles of the individual contributors to the DNA sample. Thus, in cases where one or more of the contributors were unknown, an objective of interest would be the separation, often called deconvolution, of these unknown profiles. In order to obtain deconvolutions of the unknown DNA profiles, we introduced a multiple population evolutionary algorithm (MEA). We allowed the mutation operator of the MEA to utilise that the fitness is based on a probabilistic model and guide it by using the deviations between the observed and the expected value for every element of the encoded individual. This guided mutation operator (GM) was designed such that the larger the deviation the higher probability of mutation. Furthermore, the GM was inhomogeneous in time, decreasing to a specified lower bound as the number of iterations increased. We analysed 102 two-person DNA mixture samples in varying mixture proportions. The samples were quantified using two different DNA prep. kits: (1) Illumina ForenSeq Panel B (30 samples), and (2) Applied Biosystems Precision ID Globalfiler NGS STR panel (72 samples). The DNA mixtures were deconvoluted by the MEA and compared to the true DNA profiles of the sample. We analysed three scenarios where we assumed: (1) the DNA profile of the major contributor was unknown, (2) DNA profile of the minor was unknown, and (3) both DNA profiles were unknown. Furthermore, we conducted a series of sensitivity experiments on the ForenSeq panel by varying the sub-population size, comparing a completely random homogeneous mutation operator to the guided operator with varying mutation decay rates, and allowing for hill-climbing of the parent population.
- Europe > Austria > Vienna (0.14)
- Europe > Denmark > North Jutland > Aalborg (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.86)
- Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (0.85)
'Secret sharing': Researchers say they've found way to better encrypt genetic data
WASHINGTON – Using nothing more than a simple vial of saliva, millions of people have created DNA profiles on genealogy websites. But this wealth of information is effectively inaccessible to genetics researchers, with the sites painstakingly safeguarding their databases, fearful of a leak that could cost them dearly in terms of credibility. This problem of access is one that Bonnie Berger, a professor of mathematics at Massachusetts Institute of Technology, and her colleagues think they can solve, with a new cryptographic system to protect the information. "We're currently at a stalemate in sharing all this genomic data," Berger told AFP. "It's really hard for researchers to get any of their data, so they're not really helping science. "No one can gain access to help them find the link between genetic variations and disease," she said. "But just think what could happen if we could leverage the millions of genomes out there." The idea of this new cryptographic method, described ...
Drones and deliberation – Carl Rohde – Medium
The autonomous car is unthinkable without ML and DL. Sensors monitor everything that the cars come across during their drives. Based on those data ML and DL accomplish their wondrous processing works. What I didn't know was that all those objects noticeable during the drives, all those data-entries are coded by real human beings, from frame to frame –and by real hands. A label for each tree passed by, each traffic sign, each threshold in front.
- North America > United States (0.05)
- Asia > China (0.05)
- Health & Medicine (0.98)
- Transportation > Ground > Road (0.35)
- Information Technology > Communications > Social Media (0.65)
- Information Technology > Data Science > Data Mining > Big Data (0.40)
- Information Technology > Artificial Intelligence > Machine Learning (0.40)
- Information Technology > Artificial Intelligence > Robots > Autonomous Vehicles (0.35)